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Abstract 

In adaptive testing, item selection is sequentially optimized during the test. Since the 
optimization takes place over a pool of items calibrated with estimation error, capitalization 
on these errors is likely to occur. How serious the consequences of this phenomenon are 
depends not only on the distribution of the estimation errors in the pool or the ratio of the test 
length to the pool size, but also on the structure of the item selection criterion used. A 
simulation study demonstrated the existence of the phenomenon empirically. It also showed 
that its effect on the errors in the ability estimates interacts strongly with the distribution of 
the items in the pool. 
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Capitalization on Item Calibration Error in Adaptive Testing 

The ideal underlying computerized adaptive testing (CAT) is to adapt the properties of 
the test items optimally to the ability of the examinee. The proper framework to realize this 
goal is item response theory (IRT). The unique feature of IRT models is that they have 
separate parameters to represent the properties of the items and the ability of the examinee. 
As a consequence, these models can be used to select items such that an optimal match is 
obtained between (a function of) the values of the item parameters and the value of ability 
parameter. Since the value of the ability parameter is not known, the test begins with an a 
priori estimate of the value of the ability parameter that is updated after each new item 
response. The values of the item parameters are estimated in advance; during the test these 
estimates are usually treated as if they are the true values of the parameters. A more complete 
description of adaptive testing is given in Wainer (1990). 

One of the functions of the item parameters often used in adaptive testing, is Fisher’s 
information function (Hambleton & Swaminathan, 1985, chap. 6; Lord, 1980, chap. 5). This 
function not only has the advantage of being monotonically related to the (asymptotic) 
standard error of the ML estimator of the ability parameter but is also additive in the items. 
Use of the function is generally accompanied by the application of the maximum-information 
criterion of item selection which selects the next item to have maximum information at the 
current estimate of the value of the ability parameter. If the value of the ability parameter is 
estimated in a Bayesian fashion, that is, by its posterior distribution given the responses on 
the previous items, other functions of the item parameter values are used. A well-known 
example of these functions is the expected reduction of the posterior variance. In Bayesian 
adaptive testing, the next item is selected to minimize this function. A more complete 
description of these item selection criteria is given below. 

Application of an item selection criterion over a pool of items for a given examinee 
always involve optimization, that is, the choice of the next item for which the criterion has a 
maximum or minimum value. However, since the values for the item parameters are 
estimated, a process known as capitalization on chance may occur. The process operates on 
the fact that extreme values of a function of the item parameters can be the result of extreme 
true values of the parameters as well as large estimation errors. Consequently, if items are 
selected optimizing the value of this function, large estimation errors tend to be 
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overrepresented among the items selected. The result is an ability estimator with an accuracy 
likely to be worse than expected. 

In test theory, the phenomenon of capitalization on chance has been well addressed for 
the problem of choosing a battery of variables with the largest predictive validity for job 
performance or academic success in a selection problem. The measure usually taken to 
counter its effects is to split the sample into a screening and a calibration sample. The 
variables are then selected in the screening sample but their regression parameters are re- 
estimated in the calibration sample (Lord & Novick, 1968, chap. 13). The effect of this cross 
validation is a shrinkage of the initial estimates of the regression parameters to more realistic 
sizes. 

The problem of capitalization on chance was not addressed in the literature on test 
assembly until recently in papers by Hambleton and associates (Hambleton & Jones, 1994; 
Hambleton, Jones & Rogers, 1993). These authors show that if test forms are assembled to 
have maximum information over an ability interval and the values of the item parameters are 
estimated from a sample of N=400, the height of the information function may be 
overestimated by as much as 25-40%. Samples of this size are not uncommon in educational 
testing. 

Several factors can be expected to have an impact on the process of capitalization on 
calibration error. The first is the distributions of the errors in the estimated parameter values 
in the item pool. Obviously, the larger the errors (or the smaller the calibration sample), the 
larger the effects of the capitalization on the values of the criterion. The second is the ratio of 
the number of items selected to the number in the pool. The smaller the ratio, the larger the 
likelihood of selecting items only from those with the larger estimation errors. The roles of 
both factors were confirmed in the studies by Hambleton et al. 

The authors of this paper had no strong prior opinion as to the question whether the 
effects of capitalization on error in CAT would be more or less serious than in the assembly 
of test form with a fixed format. The size of the estimation errors and the selection ratio were 
certainly expected to remain important factors but the role of two new factors was unclear. 
The first new factor is the structure of the function of the item parameters used in the item 
selection criterion. As shown in an analysis below, item selection criteria are certainly 
sensitive to estimation error. On the other, it is known that for CATs of realistic length the 
ability estimator is quite robust with respect to the choice of the item selection criterion 
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(Chang & Ying, 1996; Veerkamp & Berger, 1997; van der Linden, 1998; van der Linden & 
Reese, 1998). The same may thus hold with respect to variation in the criterion values due to 
estimation error. The second factor deals with the question how the effects of early 
capitalization on errors in a CAT propagate later on in the test. In another context, it has been 
found that early bias in the ML ability estimator in a CAT tends to be neutralized by the 
maximum-information criterion later in the process (van der Linden, 1998). However, not 
much is known with respect to the effects of errors in the estimated values of the item 
parameter. 

From a practical point of view, errors due to capitalization on chance in CAT are 
much more serious than in the assembly of forms for paper-and-pencil testing. All items are 
selected in real time and the estimates of their parameter values are used immediately to find 
the next '’optimal” item. In adaptive testing, cross validation of item selection is impossible. 

The remainder of this paper is organized as follows: First, the item selection criteria 
used in this study are introduced and analyzed for their liability to errors in item parameter 
estimation. Then, the design of the simulation study is discussed. The last section of the paper 
presents the results from the simulation study and draws some practical conclusions. 

Item Selection Criteria and Estimation Error 



As already indicated, the effects of capitalization on calibration error in CAT depend 
not only on the size of the calibration errors but also on the function defined on the item 
parameters optimized. One of the functions in use for CAT is Fisher’s information function. 
For dichotomously scored items, the function has the following form: 



l(d)^ 



p \d). 

p,(e)Q,{d)' 



( 1 ) 



Pi(0 ) being the response function for item i, P '(6) its first derivative with respect to 6 , and 
Qi(0) s i-p.(0) (Lord, 1980, sect. 5.4). In CAT, the function is used to find the item in the 
pool that yields the largest value at 6 = d , where d is the current estimate of the ability of the 



examinee. 
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For the two-parameter logistic (2-PL) model 



Pi(0)s{l + exp[-ai(0-bi]}‘‘, 



( 2 ) 



with 3i and bi being the discrimination and difficulty parameter of item i, respectively, the 
information function is equal to 



Analytically, for a fixed value of ai the function in (3) reaches a maximum for 0 = bi » that is, 
for the 0 value that gives Pi(6 )=.50. At this point the maximum is equal to .25 a? • Thus, a CAT 
algorithm based on the maximum-information criterion will have a tendency to select items 
from the pool with values of bi close to 0 and large values for ai. 

The critical factor in (3) is the size of the discrimination pafameter ai rather than the 
factor Pi{9)Q.{6). Because the parameter is squared in (1), the effect of estimation errors is 
enlarged. On the other hand, the factor Pi(^)Qj(^) in (1) is quite robust with respect to values 
for bi in the neighborhood of the 6 value of the examinee, even for larger values of ai. If the 
value of Pi(0) is in the range of [.40, .60], the maximcil difference between the product 
Pi (^)Qi(^) and its maximum value is .01. If the range is enlarged to [.30,.70], the difference is 
still not larger than .04. Thus, a CAT algorithm based on the maximum-information criterion 
can be expected to capitalize on large errors in a; but to be relatively robust with respect to 
errors in bi. 

If the three-parameter logistic (3-PL) model 



with guessing parameter Ci is chosen, the structure of the information function is remains 
identical to the one in (3). The only change is the replacement of the factor Pi(^ )Qi(^ ) in (3) 



IiW = a?Pi(^)QiW. 



( 3 ) 



Pi(^) = Ci + (l’Ci){l + exp[-ai(^“bil} 



( 4 ) 
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1-Ci 



} /Pi(0)]Qi(0). 



( 5 ) 



(In all the expressions, Pi(0) and Qi(0) still denote values obtained under the 2-PL model.) 
Note that (5) is generally smaller than the factor Pi(0 )Qi(0 ) in (3) but that equality is obtained 
if Ci — > 0. It can therefore be concluded that (5) has a smaller effect on the value of the 

information function than the factor Pi{d)Qi{6) in (3) and that the value of the discrimination 
parameter % remains the critical factor. 

A Bayesian criterion for item selection in CAT is the criterion of minimum expected 
posterior variance. An approximate version of the criterion for use in CAT was introduced by 
Owen (1975). In the criterion it is assumed that the ability estimation starts with a prior 
distribution for 6 which is updated after each item response using Bayes theorem. The next 
item is selected to have a predicted posterior distribution with minimum variance among all 
items. For a more detailed description of this criterion, see van der Linden (1998). 

To present the criterion more formally, let (ui,...,Uk.i) be the responses obtained on the 
first k- 1 items in the CAT. If item i is selected, the expected posterior variance is 



ipi(Ui=jiu, 



j=i 



“k-l)Var(eiu,....,U|j_,,Ui = j). 



( 6 ) 



where V ar(0lui . . ,Uk-i) is the posterior variance of 0 and 



Pi(Ui = jiui,...,uk.])=J Pi(Ui = jie)g(eiui,...,uk.i)de 



( 7 ) 



is the posterior predictive probability of response Ui on item i given the responses ui,...,Uk.i to 
the previous items. The next item is selected to have a minimal value for (6) among the items 
in the pool. 

A variation of the criterion in (6) is the maximum expected posterior- weighted 
information criterion. The criterion also predicts the probabilities of responses Ui=l and Ui=0 
for each item i in the pool but uses these probabilities to calculate the expected posterior- 
weighted information: 
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2 

E Pi(Ui = jlui,...,uic-i)llui Uk-i.Ui(®)S(6lui,...,uk-i.Ui = j)de (8) 

j=' 

where g(0 I .) is the posterior density of 6 after k-1 items have been selected. 

The critical difference between the maximum-information criterion in (3) and the 
maximum expected posterior-weighted information in (8) is the role of the posterior 
distribution of 0 . In (3) the information function is evaluated close to the center of the posterior 
distribution of 6 whereas in (8) the information function is integrated over the full posterior. It 
is expected that the two criteria show different behavior at the beginning of the test where (3) 
has a preference for information functions that peak at the center of the posterior but that 
differences disappear as the posterior itself becomes peaked later in the test. 

Simulation Study 

To further explore the role of capitalization on error in CAT a simulation study was 
conducted. The effects of the following factors were studied: 

1 . The size of the calibration sample (N=500, 1500, 2500, ); 

2. The length of the test (n=10, 20, 40); 

3. The size of the item pool (k=40, 80, 400, 1200); 

4. The nature of the item selection criterion (maximum information; minimum 
expected posterior variance; maximum expected posterior-weighted 
information). 

In all cases, ability was estimated using the expected a posteriori (EAP) estimator with a 
N(0,1) prior. For the maximum information criterion, ability was also estimated using the 
weighted maximum likelihood (WML) estimator derived in Warm (1989). The latter is 
attractive because of its negligible bias. 
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Method 

A calibrated pool of items was simulated as follows. A data matrix with 1,000 
examinees by 100 items was available from a Dutch national school leaving exam of English 
as a foreign language. The items were calibrated under the 2-PL model in (2) using the 
method of marginal maximum likelihood estimation with a N(0,l) distribution for the ability 
parameter in the model. In addition, the information matrix for the item parameters was 
estimated from the data. To simulate calibration samples of different sizes, the required 
numbers of examinees were drawn from the data matrix at random and with replacement. As 
the information matrix is additive in the examinees, it could easily be adapted to the various 
samples of examinees. 

The true parameter values were equated to the values estimated from the data matrix; 
their distributions are displayed in Table I. The distribution of the values for the item 

[Table 1 about here] 

difficulty parameter had a mean of .970, for an ability distribution with mean and standard 
deviation normed at .00 and 1.0, respectively. Thus, the item pool was relatively difficult for 
the examinees. 

Item calibration errors were drawn from normal distributions using the information 
matrix to calculate their variances. To simulate calibrated pools with larger numbers of items, 
the set of true values of the item parameters were duplicated and independent draws for the 
error distributions were made. 

Each of the item pools in this study had 1 ,200 simulated items. In one part of the study 
the item pool consisted of a mixture of items calibrated using different sample sizes; one third 
of the items was simulated to be calibrated on a sample of 500 examinees, one third on a 
sample of 1500 examinees, and one third on a sample of 2500 items. These sections of the 
pool thus had identical distributions of their true parameter values but differed in the size of 
their calibration errors. The presence of capitalization on calibration errors was examined by 
counting the numbers of times items from the three sections were used in the adaptive tests. 

In the second part of the study, the item pools were homogeneous with respect to the 
size of the calibration sample. These pools were used to assess the effect of item calibration 
error on the final ability estimates in the adaptive procedures. 

The adaptive testing procedure was replicated 100 times for 6 = 2.0, -1.0, 0.0, 1.0, 
2.0. to obtain stable estimates of the counts and mean absolute errors. 
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Results 

Figure 1-3 display the counts of the numbers selected in the adaptive procedure from 

[Figures 1-3 about here] 

the sections in the item pool calibrated on samples of N=500, 1500, and 2500 examinees as a 
function of 0. In each panel, the curves always sum to lOOn (that is, the number of replications 
times test length). The dominant impression from the figures is that the smaller the size of the 
calibration sample, the larger the number of items selected. A surprisingly strong effect was 
present for the maximum posterior-weighted expected information criterion in combination 
with tests of n=10 items. However, an exception was obtained for maximum-information 
criterion and WML ability estimation for n=10; an explanation for this anomaly could not be 
found. The effect showed a tendency to decline for tests with 40 items but was still present at 
this test length, in particular at the high end of the ability scale. 

Though not reported in these figures, the values of the discrimination parameters, aj, 
for the items selected were broken down into sets of items with ai<.7 and ai>.7. This 
distinction roughly corresponds to items with discrimination values below and above the 
average value for the items in the pool (see Table 1). However, for nearly all 6 values and 
item-selection criteria, items with values for ai in the lower category were never chosen. The 
only exception were a few cases with low 0 values for the maximum-information criterion. 
These results remind us of a experience well known in the practice pf adaptive testing: Due to 
the presence of low discriminating items, the effective size of the item pool is generally much 
smaller than the number of items present in the pool. 

In Figures 4-6, each curve represents the mean absolute error in the ability estimates 

[Figures 4-6 about here] 

estimates as a function of 6 for the item pools calibrated on samples with sizes of N=5(X), 
1500 and 2500 examinees, the mixture of these samples sizes used above, and the true 
parameter values (N=oo). For n=10, the U-shaped curves typical of a short adaptive test with a 
prior for the ability parameter located at at 0=0 were obtained. For n=20 and 40 the curves 
became flatter, where the curves for the Bayesian item-selection criteria tended to be lower and 
flatter than those for the maximum-information criterion. Though the four criteria showed 
different degrees of capitalization on calibration error in Figures 1-3, the curves in Figures 3-6 
were more homogeneous. The most conspicuous property of the latter, however, was much 
larger variation in the mean absolute error between the different calibration samples at the 
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higher part of the ability scale. Also, at this part of the scale, the size of the mean absolute errors 
was inversely related to the size of the calibration sample. This result is due to the larger supply 
difficult items in the pool (see Table 1). As a consequence, the item-selection ratio at this part of 
the scale is considerably smaller, and the tendency to capitalize on item parameter estimation 
errors much stronger. 

The effect of the item-selection ratio is also shown in Figure 7. For an item pool with 

[Figure 7 about here] 

with size k=40, that is, a large item-selection ratio, capitalization on calibration errors was not 
expected to occur. For this pool size, the curves in Figure 7 showed a mean absolute error in 
the ability estimates that was high at the lower end of the scale but smaller at the higher end. 
This shape reflects the fact that the majority of the items were relatively difficult. When the 
size of the item pool increased, and thus the item-selection ratio decreased, the curves for the 
smaller calibration samples deterioriated at the higher end of the scale whereas the curve for 
the true parameter values further improved. This increase in differences between the curves 
for the various samples sizes at the high end of the scale across the four panels in this figure 
is therefore expected to be due to capitalization on calibration error. 

Conclusion 

The general picture emerging from this example is that capitalization on calibration 
does occurs in adaptive testing and that its most important determinant is the item-selection 
ratio. Item pools and test lengths of various sizes were used to study the effects of this ratio 
on the ability estimates. However, due to the fact that the item pools were generated from an 
empirical data set, difficult items were overrepresented, the result being an actual item- 
selection ratio smaller than expected at the higher end of the ability scale. 

This unexpected result showed that the composition of the item pool is an important 
factor interacting with the effect of capitalization on errors in the item parameters on the 
errors in the ability estimates. Large numbers of items for certain 0 values - intuitively an 
attractive feature of an item pool - is not a desideratum if the calibration sample is small. 
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Table 1 

Distribution of true parameter values in simulated item pool 





Mean 


Minimum 


Maximum 


Standard Deviation 


% 


.777 


.222 


1.841 


.288 


bi 


.970 


-1.262 


3.590 


.885 
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Figure Captions 

Bgure 1. Numbers of items in the adaptive tests selected from the sections in the pool 
calibrated on N=500, 1500, and 2500 examinees for the various item-selection 
criteria (n=10). 

Figure 2. Numbers of items in the adaptive tests selected from the sections in the pool 
calibrated on N=500, 1500, and 2500 examinees for the various item-selection 
criteria (n=20). 

Fi gure 3 . Numbers of items in the adaptive tests selected from the sections in the pool 
calibrated on N=500, 1500, and 2500 examinees for the various item-selection 
criteria (n=40). 

Figure 4 . Mean absolute error in the ability estimates for item pools calibrated on N=500 
(solid curve), 1500 (dashed curve), 2500 (dotted curve), a mixture of these sample 
sizes (bold curve), and N=oo examinees (grey curve) for the four item-selection 
criteria (n=10). 

Figure 5 . Mean absolute error in the ability estimates for item pools calibrated on N=500 
(solid curve), 1500 (dashed curve), 2500 (dotted curve), a mixture of these sample 
sizes (bold curve), and N=oo examinees (grey curve) for the four item-selection 
criteria (n=20). 

Figure 6 . Mean absolute error in the ability estimates for item pools calibrated on N=500 
(solid curve), 1500 (dashed curve), 2500 (dotted curve), a mixture of these sample 
sizes (bold curve), and N=oo examinees (grey curve) for the four item-selection 
criteria (n=40). 

Fig ure 7 . Mean absolute error in the ability estimates for item pools calibrated on N=500 
(solid curve), 1500 (dashed curve), 2500 (dotted curve), a mixture of these sample 
sizes (bold curve), and N=«> examinees (grey curve) for pool sizes of k=40, 80, 400, 
and 1200 items (maximum-information criterion with weighted maximum likelihood 
estimation of ability; n=20). 
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